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Distributed Information Management Schemes for 
Dynamic Allocation and De-allocation of Bandwidth 

RELATED APPLICATIONS 

This application is based on a Provisional Application, Serial No. 60/301,367, filed on June 27, 
2001, entitled "Distributed Information Management Schemes for Dynamic Allocation and 
De-allocation of Bandwidth." 

FIELD OF THE INVENTION 

This invention relates to methods for the management of network connections, providing 
dynamic allocation and de-allocation of bandwidth. 
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BACKGROUND OF THE INVENTION 

Many emerging network applications, such as those used in wide-area collaborative science and 
engineering projects, make use of high-speed data exchanges that require reliable, high-bandwidth connections 
between large computing resources (e.g., storage with terabytes to petabytes of data, clustered supercomputers 
and visualization displays) be dynamically set-up and released. To meet the requirements of these applications 
economically, a network must be able to quickly provision bandwidth-guaranteed survivable connections (i.e., 
connections with sufficient protection against possible failures of network components). 

In such a high-speed network, a link (e.g., an optical fiber) can carry up to a few terabits per second. Such 
a link may fail due to human error, software bugs, hardware defects, natural disasters, or even through deliberate 
sabotage by hackers. As our national security, economy and even day-to-day life rely more and more on computer 
and telecommunication networks, avoiding disruptions to information exchange due to unexpected failures has 
become increasingly important. 

To avoid these disruptions, a common approach is to protect connections carrying critical information 
from a single link or node, called shared mesh protection or shared path protection. The scheme is as follows: 
when establishing a connection (the "active connection") along a path (the "active path") between an ingress and 
an egress node, another link-disjoint (or node-disjoint) path (the "backup path"), which is capable of establishing 
a backup connection between the ingress and egress nodes, is also determined. Upon failure of the active path, the 
connection is re-routed immediately to the backup path. 
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Note that in shared path protection, a backup connection does not need to be established at the same time 
as its corresponding active connection; rather, it can be established and used to re-route the information carried by 
the active connection after the active connection fails (and before the active connection can be restored), After the 
link/node failure is repaired, and the active connection re-established, the backup connection can be released. 
Because it is assumed that only one link (or node) will fail at any given time (i.e., no additional failures will occur 
before the current failure is repaired), backup connections corresponding to active connections that are 
link-disjoint (or node-disjoint) do not need be established in response to any single link (node) failure. Thus, even 
though these backup connections may be using the same link, they can share bandwidth on the common link. 

As an example of bandwidth sharing among the backup connections, consider two connection 
establishment requests, represented by tuple (s^d^w^, where s /c is the ingress node, d k the egress node, and w k the 

amount of bandwidth required to carry information from s k to dj 0 for k=\ and 2, respectively. As shown in Figure 
, since the two active paths Al and A2 do not share any links or nodes, the amount of bandwidth needed on links 
common to the two backup paths B 1 and B2 such as / is max{w^ 9 w 2 } (not W]4-m> 2 ). Such bandwidth sharing allows 
a network to operate more efficiently. More specifically, without taking advantage of such bandwidth sharing, 
additional bandwidth is required to establish the same set of connections; conversely, fewer connections can be 
established in a network with the same (and limited) bandwidth, 

In order to determine whether or not two or more backup connections can share bandwidth on a common 
link, one needs to know whether or not their corresponding active connections are link (or node) disjoint. This 
information is readily available when a centralized control is used. A network-wide central controller processes 
every request to establish/tear-down a connection, and thus can maintain and access information on complete 
paths and/or global link usage. However, centralized controls are neither robust nor scalable as the central 
controller can become another point of failure or a performance bottleneck. In addition, the amount of 
information that needs to be maintained is also enormous when the problem size (i.e., network size and/or number 
of requests) is large. Finally, no polynomial time algorithms exist to effectively obtain optimal bandwidth 
sharing, and Integer Linear Programming (ILP) based methods are very time consuming for a large problem size. 
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The following three schemes, all under centralized control, have been proposed. In each scheme, it is 
assumed that a central controller knows the network topology as well as the initial link capacity (i.e. C a for every 
link a). 

To aid our discussion, the following acronyms and abbreviations will be used: 
NS: No Sharing 

SCI: Sharing with Complete Information 

SPI: Sharing with Partial Information 

(S)SR: (Successive) Survivable Routing 

DCIM: Distributed Complete Information Management 

DPIM: Distributed Partial Information Management 

DPIM-SAM: DPIM with Sufficient cost estimation, Aggressive cost estimation and Minimum bandwidth 
allocation 

WDM: wavelength-division multiplex (or multiplexed) 

MPLS: Multi-protocol label switching 

MPA,S: Multi-protocol Lambda (i.e., wavelength) switching 

E: set of directed links in a network (or graph) N. The number of links is \E\. 

V: set of nodes in a network. It includes a set of edge nodes V e and a set of core nodes V c . The number of 
nodesis|F|H^MJ>y. 

C e : Capacity of link e. 

A e : Set of connections whose active paths traverse link e. 

F e ~ 2 k<=Ae w t Tota ^ amount of bandwidth on link e dedicated to all active connections traversing link e. 
Each such connection is protected by a backup path. 

B e : Set of connections whose backup paths traverse link e, 

G e : Total amount of bandwidth on link e that is currently reserved for all backup paths traversing link e. 

Note that, without any bandwidth sharing, G e = £ k<=Be w fr and with some bandwidth sharing, G e will be less (as to 
be discussed later). 

R e : Residua! bandwidth on link e. If all connections need be protected, R e =C e -F e -G e (see extension to the 
case where unprotected and/or pre-emptable connections are allowed for more discussions). 

^ b a -A c piB b : Set of connections whose active paths traverse link a and whose backup paths traverse link b. 

& b a = 2 ke$ b a w k- Total 0- e - aggregated) amount of bandwidth required by the connections in § b a . Note that 
b b a <F a . This is the amount of bandwidth on link a dedicated to the active paths for the connections in § b a . It is also 
the amount of bandwidth that needs to be reserved on link b for the corresponding backup paths and that may be 
shared by other backup paths. 

Q b a : cost of traversing link b by a backup path for a new connection (in terms of the amount of additional 
bandwidth to be reserved on link b) when the corresponding active path traverses link a. 

G(b): set of 5^ values, one for each link a. 

G b -maxyJ> b a : Minimum (or necessary) amount of bandwidth that needs to be reserved on link b to backup 
all active paths, assuming maximum bandwidth sharing is achieved. 
F(a)\ set of d b a values, one for each link b. 
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i^maxy^S^: Maximum (or sufficient) amount of bandwidth that needs to be reserved on any link, over all 
the links in a network, in order to backup the active paths currently traversing link a. 

In the prior-art No-Sharing scheme, no additional information needs be maintained by the central 
controller. As the name suggests, there is no bandwidth sharing among the backup connections when using this 
scheme. 

The NS scheme works as follows. For every connection establishment request, the controller tries to find 
two link-disjoint (or node-disjoint) paths meeting the bandwidth requirement specified by the connection 
establishment request. Since the amount of bandwidth consumed on each link along both the active and backup 
paths is w k units, the problem of minimizing the total amount of bandwidth consumed by the new connection 

establishment request is equivalent to that of determining a pair of link-disjoint or node-disjoint paths, where the 
total number of links involved is minimum. Consequently, the problem can be solved based on minimum cost 
flow algorithms such as the one described in the Liu, Tipper, and Siripongwutikom reference. 

Although the NS scheme is simple to implement, it is very inefficient in bandwidth utilization. 

In another prior art scheme termed Sharing with Complete Information (SCI), the centralized controller 
maintains the complete information of all existing active and backup connections in a network. More specifically, 
for every link e, both A e and B e are maintained, and based on which, other parameters such as F e and G e can be 

determined. 

With SCI, the problem of minimizing the total bandwidth consumed to satisfy the new connection request 
may be solved based on the following Integer Linear Programming (ILP) formulation, as modified from the 
Kodialam and Lakshman reference: Assume that the active and backup paths for a new connection establishment 
request which needs w units of bandwidth will traverse links a and b, respectively. In SCI, one can determine that 
the amount of bandwidth that needs to be reserved on link b is 8 b a +w. Since the amount of bandwidth already 
reserved on link b for backup paths is G b (which is sharable), we have 

(oo if a = /; or R a < w or tfjj + w — Gi, > i?/, (i) 

<5£ + w - G b else ir cY* + w > G b and 5 b a + iu - G b < R b (iii) 
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In the above equation, (i) states the constraint that the same link cannot be used by both the active and 
backup paths, and even if a and b are different links, they cannot be used if the residual bandwidth on either link is 
insufficient; further, (ii) and (Hi) state that the new backup path can share the amount of bandwidth already 
reserved on link h. More specifically, (ii) states no additional bandwidth on link b needs to be reserved in order to 
protect link a and (iii) states that at least some additional bandwidth on link b should be reserved. 

To facilitate the ILP formulation, consider a graph j\f with a set of vertices (or nodes) Fand a set of 
directed edges (or links) E. Let vector x represent the active path for the new request, where x e is set to 1 if link e 
is used in the active path and 0 otherwise. Clearly, on link e whose x e =l in the final solution, w units of additional 
bandwidth need to be dedicated. Similarly, let the vector y represent the backup path for the new request, where >> e 
is set to 1 if link e is used on the backup path and 0 otherwise. In addition, let z e be the additional amount of 
bandwidth to be reserved on link e for the backup path in the final solution. Clearly, z e must be 0 if y=0 in the 

final solution. Finally, let h(n) be the set of links originating from node n, and t{n) the set of links ending with node 
/?. 

The objective of the ILP formulation is to determine active and backup paths (or equivalently, vectors x 
and y) such that the following cost function is minimized: 



subject to the following constraints: 
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«fc>4t(*« + Ifr-l) V0V6 
Xc,Ve e {0,1} 

and 

* > o 

As mentioned earlier, such a scheme allows the new backup path to share maximum bandwidth with other 
existing backup paths but has two major drawbacks that make it impractical for a large problem size. One is the 
total amount of information (i.e., A e and B e for every link e) that needs to be maintained (which is 0(L-\V\), where 
L is the number of connections, and | V\ is the number of nodes in a network), as well as the overhead involved in 
updating such information for every request (which is OQV\))> These will Likely impose too much of a burden on a 
central controller. The other is the maximum bandwidth sharing comes at a price of solving the ILP formulation, 
which contains many variables and constraints, in other words, a high computational overhead. For example, to 
process one connection establishment request in a 70-node network, it takes about 10-15 minutes on a low-end 
workstation. 

Another prior art scheme we will discuss is called Sharing with Partial Information (SPI). In this scheme, 
only the values of F e and G e (from which R e can be easily calculated) for every link e are maintained by the central 
controller. 

For SPI, an ILP formulation similar to the one described above can be used. More specifically, one can 
replace 5 b a with F a in the equation for Q b a (See the Kodialam and Lakshman reference) This is a conservative 

approach as F>b b a9 \fb. A quicker method which obtains a near-optimal solution for SPI in about 1 second was 
also suggested in the Kodialam and Lakshman reference. 

foo if a = b or R Q . < w or F a + w - G (l > Rh CD 

0 else if F a + w < Gu (ii') 

F a + w - G h else it F a + w > G b and F a + w - G h < Rh (iii') 
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While the ILP formulation takes as much time to solve as in SCI, SPI achieves a lower bandwidth sharing 
(and thus lower bandwidth utilization) when compared to SCI as the price paid for maintaining partial information 
(and thus reducing book-keeping overhead). 

The final prior-art scheme we will discuss are so-called Survivable Routing (SR) and Successive 
Survivable Routing (SSR). In these schemes, instead of maintaining complete path (or per flow) information as in 
SCI, global link usage (or aggregated) information is maintained. More specifically, in the distributed 
implementation proposed by the Liu, Tipper, and Siripongwutikorn reference, every (ingress) node maintains a 
matrix of 8 b a for all links a and b. Also, for every connection establishment request, an active path is found first 
using shortest path algorithms. Then, the links used by the active path is removed, and each remaining link is 
assigned a cost equal to the additional bandwidth required based on the matrix S b a9 and a cheapest backup path is 

chosen. After that, the matrix of is updated and the updated values are broadcast to all other nodes using Link 

State Advertisement (LSAs). 

The main difference between SR and SSR is that, in the latter, existing backup paths may change (in the 
way they are routed as well as the amount of additional bandwidth reserved) after the matrix 8 b a is updated (e.g. 

as a result of setting up a new connection). 

While it has been mentioned in the Kodialam and Lakshman reference that the NS, SPI and SCI schemes 
described earlier are amendable to implementation under distributed control, no detail of distributed control 
implementation of any of these schemes has been provided. 

Further, even though the Liu, Tipper, and Siripongwutikorn reference provides a glimpse of how paths 
(active and backup) can be determined, and how the matrix of 8 b a can be exchanged under distributed control in 
SR and SSR, no details on signaling (i.e., how to set up paths) is provided. In addition, every node needs to 
maintain OQE\ 2 ) information which is still a large amount and requires a high signaling and book-keeping 
overhead. In fact, in a WDM network where each request is for a lightpath (which occupies an entire wavelength 
channel on a link it spans), maintaining the complete path information (i.e., A e and B e ) as in SCI may not be worse 
than maintaining the matrix 8 b a . 
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Therefore, an object of the instant invention is to provide an improved distributed control implementation 
where each controller needs only partial (P(\E\)) information. 

It is another object to address the handling of connection release requests (specifically, de-allocate 
bandwidth reserved for backup paths) that is not addressed in any prior art, especially under distributed control 
and with partial information. (In NS, bandwidth de-allocation on backup paths is trivial but in SCI (or SR/SSR), it 
incurs a large computing, information updating and signaling overhead.) It is a related object to provide a scheme 
that de-allocates bandwidth effectively under distributed control with only partial information (In SPI, 
de-allocation of bandwidth along the backup path upon a connection release is impossible). 

Performance evaluation results have shown that in a 15-node network, after establishing a couple of 
hundreds of connections, SPI results in about 1 6% bandwidth saving when compared to NS, while SCI (SR, SSR) 
can achieve up to 37%. It is a further object of the invention to provide distributed control schemes based on 
partial information that can achieve up to 32% bandwidth savings. 

SUMMARY OF THE INVENTION 

In order to achieve the above objects, the invention presents distributed control methods for on-line 
dynamic establishment and release of protected connections which achieve a high degree of bandwidth sharing 
with low signaling and processing overheads and having distributed information maintenance. Efficient 
distributed control methods will be presented to determine paths, maintain and exchange partial information, 
handle connection release requests and increase bandwidth sharing with only partial information. 

In the following discussion, it is assumed that connection (establishment or release) requests arrive one at 
a time, and when each request is processed, no prior knowledge about future requests is available. In addition, 
once the path taken by an active connection and the path selected by the corresponding backup connection are 
determined, they will not change during the lifetime of the connection. Further, it is first assumed that all 
connections are protected, and then the extension to accommodate unprotected and pre-emptable connections will 
be discussed further below. 
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BRIEF DESCRIPTION OF THE DRAWINGS 

Figure 1 is an example showing backup paths and bandwidth sharing among backup paths. 

Figure 2 shows a Base Graph showing a directed network where there is no existing connection at the 
beginning 

Figure 3(1) shows a connection from nodes A to D with w=5 has been established, using link e 6 on its 
active path and link e 5 on its backup path. 

Figure 3(2) shows another connection from C to D with w=5 being established. 

Figure 3(3) shows that using the simplest form of DPIM, additional six units of backup bandwidth is 
required on link e7. 

Figure 3( c ) shows that using DPIM-S, only one additional unit is required. 
Figure 4 shows Hop-by-hop Allocation of Minimum Bandwidth (or the M approach) 
Figure 4(1). shows the bandwidth allocated after connection A to D is established. 
Figure 4(2) shows the bandwidth allocated after connection C to D is established. 

Figure 4(3) shows that using an ordinary method, one additional unit of bandwidth is needed on e7 for the 
new connection B to D. 

Figure 4(3') shows that using the minimum allocation method, no additional bandwidth is needed on e7 
for connection B to D. 

DETAILED DESCRIPTIONS OF THE PREFERRED EMBODIMENTS 

Under distributed control, when a connection establishment request arrives, a controller (e.g. an ingress 
node) can specify either the entire active and backup paths from the ingress node to the egress node as in explicit 
routing, or just two adjacent nodes to the ingress node, one for each path to go through next (where another routing 
decision is to be made) as in hop-by-hop routing. A compromise, called partially explicit routing, is also possible 
where the ingress node specifies a few but not all nodes on the two paths, and it is up to these nodes to determine 
how to route from one node to another (possibly in a hop-by-hop fashion). 
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In the following discussion on the novel schemes based on what we will call "Distributed Partial 
Information Management (DPIM)", it is assumed that each request (to either establish or tear-down a connection) 
arrives at its ingress node, and every edge node (which is potentially an ingress node) acts as a controller that 
performs explicit routing. Most of the concepts to be discussed also apply to the case with only one such 
controller (as in centralized control). The same concepts also apply to the case with one or more controllers that 
perform hop-by-hop routing or partially explicit routing. 

In addition, we will assume that each edge node (and in particular, potential ingress node) maintains the 
topology of the entire network by, e.g., exchanging link state advertisements (LSAs) among all nodes (edge and 
core nodes) as in OSPF. These edge nodes may exchange additional information using extended LSAs, or 
dedicated signaling protocols, depending on the implementation. 
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Information Maintenance 

In DPIM, each node n (edge or core) maintains F e , G e and R e for all links eeh(n) (which is very little 
information though one may reduce it further, e.g., by eliminating F e ). 

What is novel and unique about DPIM is that each edge (ingress) node maintains only partial information 
on the existing paths. More specifically, just as a central controller in SPI, it maintains only the aggregated link 
usage information such as F e , G e and R e for all links eeE. Any updates on such information only need be 
exchanged among different nodes (and in particular, ingress nodes), as described below. 

In addition, each node (edge or core nodes) would also maintain a set of 5 e a values for eveiy link e 
originating from the node. More specifically, for each outgoing link eeh(n) at node n, node n would maintain (up 
to) \E\ entries, one for each link a in the network. Each entry contains the value of 8 e a for link a<zE (note that one 
may use a linked list to maintain only those entries whose 5 e fl >0). Since any given node has a bounded nodal 
degree (i.e., the number of neighboring nodes and hence the outgoing links) d, the amount of information needs to 
be maintained is 0(d-\E\) > which is independent of the number of connections in a network. Based on this set of 8 e a 

values, (which is denoted by G(e)\ G e can be determined (G e =max Va 5 e fl ). This information is especially useful 

for de-allocating bandwidth effectively upon receiving a connection tear-down request, and need not be 
exchanged among different nodes. 

In other embodiments of the invention, DPIM implementations can be enhanced to carry additional 
information maintained by each node. For example, in what we will call DPIM-A (where A stands for Aggressive 
cost estimation), each node n maintains a set of 8 b e values, denoted by F{e\ for each link eeh(n). The setF(e), (as 

a complement to the set G described above), contains (up to) \E\ entries of 5 b e , one for each link b in the network 
(note that again, one may use a linked list to maintain only those entries whose 5^ e >0). This information is used to 
improve the accuracy of the estimated cost function and need not be exchanged among different nodes. In 
addition, each ingress node maintains F e (instead of F e ), where F e =maxy b 5 b e , for all links eei Just as G e and R e , 
any updates on F e needs to be exchanged among ingress nodes. 
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In all cases, the amount of information maintained by an edge (or core) node is 0(d-\E\) where d is the 
number of outgoing links and usually small when compared to \E\> In addition, the amount of information that 
need be exchanged after a connection is set up and released is 0(\E\). 

Path Determination 

In the preferred basic implementation of DPIM, an ingress node determines the active and backup paths 
using the same Integer Linear Programming formulation as described earlier in our discussion on the prior art SPI 
scheme (in particular, note equations (i 5 ), (ii 5 ) and (iii 5 ) for the cost estimation function). One can improve the ILP 
formulation (which affects the performance only slightly) by using the following objective function instead: 

where [epsilon](<l) is set to 0,9999 in our simulation. One may also protect a connection from a single node 
failure by transforming the graph N representing the network using a common node-splitting approach described 
in the Suurballe and Tarjan reference, and then apply the same constraints as those used for ensuring link-disjoint 
paths. 

Note that if the ingress node fails to find a suitable pair of paths because of insufficient residual 
bandwidth, for example, the connection establishment request will be rejected. Such a request, if submitted after 
other existing connections have been released, may be satisfied. 

The two following methods can be used to improve the accuracy of the estimation of the cost of a backup 
path, and in turn, select a better pair of active and backup paths. 

One is called DPIM-S, where S stands for Sufficient bandwidth estimation. In DPIM-S, equation (iii 5 ) 
becomes Q b a =min{F a +w-G b ,w} (instead of Q b a =F a +w-G b ) (one should also replace F a +w-G in equations (F) and 
(iii 5 ) with min{F a +w-G b ,w})> 

An example showing the improvement due to DPIM-S is as follows. Consider a directed network shown 
in Figure where there are no existing connections in the beginning. Now assume that a connection from nodes A 
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to D with w=5 has been established, using link e 6 on its active path and link e 5 on its backup path, as shown in 
Figure (1). Thereafter, another connection from C to D with w=5 has been established as shown in Figure (2). In 
order to establish the third connection from B to D with w=l 9 DPIM needs to allocate 6 additional units of 
bandwidth on link e 7 as in Figure 3 (3) but DPIM-S only needs to allocate 1 additional unit as in Figure 3(3'). 

The other is called DPIM-A, (where A stands for Aggressive cost estimation). In DPIM-A, equation (iii') 
becomes Q b a =F a +w-G b (one should also replace F a with F a in the conditions for equations (i') through (iii')). 

Because F c >F £ >^ a , , such an estimation is closer to the actual cost incurred than if SCI were used. 

In another embodiment, the above two cost estimation methods can be combined into what we call 
DPIM-SA, where equation (iii's) becomes 



The above backup cost estimation may lead to long backup paths, thus a longer recovery time as some 
links may have zero backup cost. An improvement therefore is to use the following cost estimation instead of 
Equations (ii') and (iii 5 ): 



The above cost estimation technique can be used in conjunction with the modified objective function as 
stated in the beginning of this subsection to yield solutions that not only are bandwidth efficient but also can 
recovery faster because of shorter backup paths. 

In order to determine paths quickly and efficiently, we propose a novel heuristic algorithm called Active 
Path First (APF) as follows: Assume that DPIM-S is used. It first removes the links e whose R e is less than w from 

the graph N representing the network, then finds the shortest path (i n terms of number of hops) for use as the active 
path, denoted by A. It then removes the links aeA from the original graph N and calculates, for each remaining 



0l = min(F a + w - G b , w} 




mm{ max VaeA (F a +w-G b , juw),w} 
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link b, min{F A J rw-G b ,M>} where F^mv&y a ^ A F a . ^ va ^ e exceeds R b , the link b is removed from the graph. 

Otherwise, it is assigned to the link b as a cost. Finally, a cheapest path is found as the backup path. 

If DPIM-SA is used, one can simply replace F a with F a (in which F A =m?iK.y a ^ A F^). 

In another embodiment, we propose to logically remove all links whose residue bandwidth is less than w 9 
and then find a shortest pair of paths, the shorter of the two shall be the active path and the other the backup path 
along which minimum amount of backup bandwidth will be allocated using the method to be described below. 

We also propose a family of APF-based heuristics which take into account the potential backup cost (PBC) 
when determining the active path. The basic idea is to assign each link a cost of w+B(w), where B(w) can be 
defined as follows: 

F 
M 

where c is a small constant for example between 0 and 1, and M is the maximum value of Fe over all links e. 

Altenatively, other PBC functions can be used which returns a non-zero value that is usually proportional 

to w and Fa. One such example is 5(w) ~W-e M where X is also a small constant. 

Also, to maintain minimum amount of partial information and require minimum changes to the existing 
routing mechanisms employed by Internet Protocol (IP), we also propose to remove all remaining links with less 
than w unit of residue bandwidth and assign each eligible link with cost of w before applying any shortest-path 
algorithm to find the backup path. This approach can also be bandwidth efficient as long as backup bandwidth 
allocation is done properly as to be described in the next subsection (using the M-approach). 

Finally, to tolerate a single node failure, one can remove the nodes (instead of just links) along the chosen 
active path first before determining the corresponding backup path. 

Path Establishment and Signaling Packets 

In DPIM, once the active and backup paths are determined, the ingress node sends signaling packets to the 
nodes along the two paths. More specifically, let y4={a z |/=l ,2,-~p} and 2N{fyl/M,2,'"<jr} be the set of links along 
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the chosen active and backup paths, respectively. A "connection set-up" packet will then be sent to the nodes 
along the active path to establish the requested connection, which contains address information on the ingress and 
egress nodes as well as the bandwidth requested (i.e. m>) 5 amongst other information. This set-up process may be 
carried out in any reasonable distributed manner by reserving w units of bandwidth on each link a t sA, creating an 

switching/routing entry with an appropriate connection identifier (e.g., a label), and configuring the switching 
fabric (e.g., a cross-connect) at each node along the active path, until the egress node is reached. The egress node 
then sends back an acknowledgment packet (or ACK). 

In addition, a "bandwidth reservation" packet will be sent to the nodes along the chosen backup path. This 
packet will contain similar information to that carried by the "connection set-up" packet. At each node along the 
backup path, similar actions will also be taken except that the switching fabric will not be configured. In addition, 
the amount of bandwidth to be reserved on each link bjsB may be less than w due to potential bandwidth sharing. 

This amount depends on the cost estimation method (e.g., DPIM, DPIM-S, DPIM-A, or DPIM-SA) described 
above as well as the bandwidth allocation approach to be used, described next. 

Bandwidth Allocation on Backup Path 

There are two approaches to bandwidth allocation on a backup path. In particular, the information on how 
much bandwidth to be reserved on each link bj<=B can be determined either by the ingress node or by node n along 

the backup path, where bjGhQi). More specifically, in the former case, called Explicit Allocation of Estimated 
Cost (EAEC), the ingress node computes, for all bp F A -m^^ aiGA Q b Jai appropriately (depending on whether 
DPIM, DPIM-S, DPIM-A or DPIM-SA is used) and then attach the values,* one for each b j9 to the "bandwidth 
reservation" packet. Upon receiving the bandwidth reservation packet, a node n along the backup path allocates 
the amount of bandwidth specified for an outgoing link bj<=h(n). 

In the latter case, called Hop-by-hop Allocation of Minimum Bandwidth or HAMB (hereafter called the 
M approach for simplicity where M stands for Minimum), the "bandwidth reservation" packet contains the 
information on the active path and w. Upon receiving this information, each node n that has an outgoing link e<=B 
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updates the set G(e) and then G e . Thereafter, the amount of bandwidth to be allocated on link e, denoted by bw, is 
G e -G e if the updated G e exceeds G e , and 0 otherwise. In addition, if bw>0, then G e and R e are reduced by bw, and 
the updated values are multicast to all ingress nodes using either extended LSAs or dedicated signaling protocols. 
Note that only p entries in G(e) that correspond to links a^A, where p is the number of links on the active 

path, need be updated (more specifically, d e ai need be increased by w) 9 and the new value of G e is simply the 

largest among all the entries in G(e), or if the old value of G e is maintained, the largest among that and the values 
of the newly updated p entries. 

The advantage of the M approach is that it achieves a better bandwidth sharing even than the best EAEC 
(i.e., EAEC based on DPIM-SA). For example, assume that two connections from A to D and from C to D, have 
been established as shown in Figure 4 (1) and (2). Consider a new connection from B to D with w=2 which will 
use e 6 and e 7 on the active and backup paths, respectively. Since F e6 =2 and C? e7 =3 (prior to the establishment of 
the connection), using EAEC (based on DPIM-SA), one still needs to allocate 1 additional unit of backup 
bandwidth on e n as shown in Figure 4(3). However, using the M approach, G e7 is still 3 after establishing the 
connection, so no additional backup bandwidth on e 7 is allocated as in Fig 4(3'). 

Since G e is the necessary (i.e., minimum) backup bandwidth needed on link e, hereafter, we will refer to a 
distributed information management scheme that uses the M approach for bandwidth allocation as either 
DPIM-M, DPIM-SM, DPIM-AM or DPIM-SAM, depending on whether DPIM, DPIM-S, DPIM-A or DPIM-SA 
is used for estimating the cost of the paths when determining the paths. When "M" is omitted, the EAEC approach 
is implied. Note that because in any DPIM scheme, the paths are determined without the complete (global) b b a 

information, DPIM-SAM will still under-perform the SCI scheme which always finds optimal active and backup 
paths. Due to the lack of complete information, DPIM-SAM is only able to achieve near optimal bandwidth 
sharing in a on-line situation. It is not designed for the purpose of achieving global optimization via, for instance, 
re-arrangement of backup paths). 
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More on Bandwidth Allocation on an Active Path 

Bandwidth allocation on an active path is a straight-forward matter. However, in either the EAEC or M 
approach, if DPIM-A (or DPIM-SA) is used to estimate the cost when trying to determine active and backup paths 
for each request, after the two paths (Active and Backup) are chosen to satisfy a connection-establishment request, 
a "connection set-up" packet sent to the nodes along the active path will need to carry the information on the 
chosen backup path in addition to w and other addressing information. Upon receiving such information, each 

node n that has an outgoing link e^A updates the set F(e) and then F e . The updated values of F e for every e^A are 
then multicast to all ingress nodes along with information such as R e . 

Note that only q entries in F(e) that correspond to links bj<zB, where q is the number of links on the backup 
path, need be updated (more specifically, 8 b Je need be increased by w), and the new value of F Q is simply the 

largest among all the entries in F(e), or if the old value of F e is maintained, the largest among that and the values 
of the newly updated q entries. 

Clearly, compared to DPIM or DPIM-S, DPIM-A (or DPIM-SA) requires each node n to maintain set F(e) 
each outgoing link eeA(w). In addition, it requires that each "connection set-up" packet to carry the backup path 

information as well as some local computation of F e . Nevertheless, our performance evaluation results show that 

the benefit of DPIM-A in improving bandwidth sharing (and in determining a better backup as described earlier) is 
quite significant. 

Connection Tear-Down 

When a connection release request arrives, a "connection tear-down" packet and a "bandwidth release" 
packet are sent to the nodes along the active and backup paths, respectively. These packets may carry the 
connection identifier to facilitate the bandwidth release and removal of the switching/routing entry corresponding 
to the connection identifier. As before, the egress will send ACK packets back. 

Bandwidth de-allocation on the links along an active path A is straight-forward unless DPIM-A is used. 
More specifically, if DPIM-A is not used, w units of bandwidth are de-allocated on each link eeA, and the updated 
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values of F e and R e are multicast to all the ingress nodes. The case where DPIM-A (or DPIM-SA, DPIM-SAM) is 
used will be described at the end of this subsection. 

Although bandwidth de-allocation on the links along a backup path B is not as straight-forward, it 
resembles bandwidth allocation using the M approach. More specifically, to facilitate effective bandwidth 
de-allocation, each "bandwidth release" packet will cany the information on the active path (i.e., the set A) as well 
as w. Upon receiving this information, each node n that has an outgoing link e^B updates the set G(e) and then G e 

. Thereafter, the amount of bandwidth to be deallocated on link e is bw=G e -G e >0. If bw>0 s then G e changes to G e 

and R e increases by bw, and the updated values are multicast to all ingress nodes. Note that this implies that each 

node n needs to maintain G e as well as the set G(e) for each link eeh(n) to deal with bandwidth deallocation, even 

though such information may seem to be redundant for bandwidth allocation (e.g., when using the EAEC 
approach). 

If DPIM-A (or DPIM-SA) is used, releasing a connection along the active path can be similar to 
establishing a connection along the active path when DPIM-A (or DPIM-SA) is used. Specifically, each 
"connection tear-down" packet will contain the set B, and upon receiving such information, a node n that has an 

outgoing link ee.A updates the set F{e) as well as ^g for link e, and then multicast the updated F e to all ingress 
nodes. 

Information Distribution and Exchange Methods 

We have assumed that the topological information is exchanged using LSAs as in OSPF. We have also 
described the information to be carried by the signaling packets used to establish and tear-down a connection. In 
short, the difference between the two bandwidth allocation approaches, EAEC and M, in terms of the amount of 
information to be carried by a "bandwidth reservation" or "bandwidth release" packet is not much. If DPIM-A (or 
DPIM-SA) is used, more information needs be carried by a "connection set-up" or "connection tear-down" packet. 
But the amount of information is bounded by OQV\), 



20 



WO 03/003156 PCT/US02/20276 
Here, we discuss the methods to exchange information such as F e , G e or R e . As mentioned earlier, one 
method, which we call core-assisted broadcast (or CAB), is to use extended LSAs (or to piggyback the 
information onto existing LSAs). A major advantage of this method is that no new dedicated signaling protocols 
are needed. One major disadvantage is that such information, which is needed by the ingress nodes only, is 
broadcast to all the nodes, which results in unnecessary signaling overhead. Another disadvantage is that the 
frequency at which such information is exchanged has to be tied up with the frequency at which other LSAs are 
exchanged. When the frequency is too low relative to the frequency at which connections are set up and 
torn-down, ingress nodes may not receive up-to-date information on F e > G e or R e and thus will adversely affect 

their decision-making ability, On the other hand, when the frequency is too high, signaling overhead involved in 
exchanging this information (and other topological information) may become significant. 

To address the deficiencies of the above method, one may use a dedicated signaling protocol that 
multicast the information to all the ingress nodes whenever it is updated. This multicast can be performed by each 
node (along either the active or backup path) which updates the information. We call such a method Core-Assisted 
Multicast of Individual Update (or CAM-IU). Since each signaling packet contains a more or less fixed amount of 
control information (such as sequence number, time-stamp or error checking/detection codes), one can further 

reduce signaling overhead by collecting the updated information on either the R ai and F ai for every link a t eA or 

R bj and G bj for every link bjeB, in one "updated information" packet, and multicast that packet to all ingress 

nodes. Such information may be collected in the ACK sent by the egress node to the ingress node, and when the 
ingress node receives the ACK, it constructs an "updated information" packet and multicasts the packet to all other 
ingress nodes. We call this type of method "Edge Direct Multicast of Collected (lump sum) Updates" or 
EDM-CU. 

Note that when EAEC is used in conjunction with DPIM or DPIM-S, the amount of bandwidth to be 
allocated on the active and backup paths in response to a connection establishment request are determined by the 
ingress node. The ingress node can then update F & G e and R e for all b^AkjB, and construct such an updated 

information packet. We call such a. method EDM-V (where V stands for value), Also, in such a case, the ingress 
node may multicast just a copy of the connection establishment request to all other ingress nodes which can then 
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compute the active and backup paths (but will not send out signaling packets), and update F e , G e and R e by 

themselves. We call such a method EDM-R (where R stands for request). To avoid duplicate path computation at 
all ingress nodes, the ingress node will compute the active and backup paths and send the path information to all 
other ingress nodes which update F e9 G e and R e . We call this alternative EDM-P (where P stands for path). Note 
that in either EDM-R or EDM-P, each ingress node will discard the computed/received path information after 
updating F e , G e and R e . 

Note also that EDM-V, EDM-P and EDM-R do not work when either a connection tear-down request is 
received, DIM-A or DIM-SA is used, or simply the M approach is used to allocate bandwidth (instead of EAEC) 
because in these situations, none of the ingress nodes knows enough information to be able to compute the updated 

F & G e and R e based on just the request and/or the paths (therefore, one needs to use CAM-IU or EDM-CU). 
Conflict Resolution 

As in almost all distributed implementations, conflicts among multiple signaling packets may arise due to 
the so-called race conditions. More specifically, two or more ingress nodes may send out "connection set-up" (or 
"bandwidth reservation") packets at about the same time after each receives a connection establishment request. 
Although each ingress node may have the most up to date information needed at the time it computes the paths for 
the request it received, multiple ingress nodes will make decisions at about the same time independently of the 
other ingress nodes, and hence, compete for bandwidth on the same link, 

If multiple signaling packets requests for bandwidth on the same link, and the residual bandwidth on the 
link is insufficient to satisfy all requests, then one or more late-arriving, low-priority, or randomly chosen 
signaling packets will be dropped. For each such dropped request, an negative acknowledgment (or NAK) will be 
sent back to the corresponding ingress node. In addition, any prior modifications made as a result of processing 
the dropped packet will be undone. The ingress node, upon receiving the NAK, may then choose to reject the 
connection establishment request, or wait till it receives updated information (if any) before trying a different 
active and/or backup path to satisfy the request. Note that if adaptive routing (hop-by-hop, or partially explicit 
routing) is used, the node where signal packets compete for bandwidth of an outgoing link, may choose a different 
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outgoing link to route some packets, instead of dropping them (and sending NAKs to their ingress nodes 
afterwards). 



Extensions to Multiple Classes of Connections 

We now describe how to accommodate two additional classes of connections in terms of their tolerance to 
faults: unprotected and pre-emptable. An unprotected connection does not need a backup path so if (and only) the 
active path is broken due to a failure, traffic carried by the unprotected connection will be lost. A pre-emptable 
connection is unprotected, and in addition, carries low-priority traffic such that even if a failure does not break the 
connection itself, it may be pre-empted because its bandwidth is taken away by the backup paths corresponding to 
those (protected) active connections that are broken due to the failure. 

The definitions above imply that an unprotected connection needs a dedicated amount of bandwidth (just 
as an active path), and that a pre-emptable connection can share bandwidth with any backup paths (but not with 
other pre-emptable connections). 

Let U e and P e denote the sum of the bandwidth required by unprotected and pre-emptable connections, 

respectively, which use link e. Like F e9 G e and R e , each node n (edge or core) maintains U e and P e for link eeh(n). 
In addition, each ingress node (or a controller) maintains U e and P e for all links e^E. 

Accordingly, define G e (P)=max{G e ,P e ) and R e (Uy=C e -F e -G e (P)-U e . When handling a request for a protected 
connection, one may follow the same procedure outlined above for DPIM and its variations after replacing R e with 
RJJJ) and G e with G e (P) in backup cost determination, path determination, and bandwidth 
allocation/de-allocation (though G e still needs be updated and maintained in addition to P e and G e (P)). 

One can deal with an unprotected connection request in much the same way as a protected connection 
with the exception that there is no corresponding backup path (and that U e , instead of F e , will be updated 

accordingly). 

Finally, one can deal with a request to establish a pre-emptable connection requiring w units of bandwidth 
as follows. First, for every link one calculates bw=P e +w-G e (P). It then assigns max{bw,0} as a cost of link 
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e in the graph N representing the network, and finds a cheapest path, along which the pre-emptable connection is 
then established in much the same way as an unprotected connection (with the exception that P e and G e (P) will be 
updated accordingly). 

Application and Extension to Other Distributed and Centralized Schemes 

All the DPIM schemes described can be implemented by using just one or more controllers to determine 
the paths (instead of the ingress nodes). Similarly, one can place additional controllers at some strategically 
located core nodes, in addition to the ingress nodes, to determine the paths. This is feasible especially when OSPF 
is used to distribute the topology information as well as additional information (such as F e , G e and R e ). This will 

facilitate partially explicit routing through those core nodes with an attached controller. More specifically, each 
connection can be regarded as having one or more segments, whose two end nodes are equipped with co-located 
controllers. Hence, the controller at the starting end of each segment can then find a backup segment by using the 
proposed DPIM scheme or its variations. 

One can also extend the methods and techniques described previously to implement, under distributed 
control, a scheme based on either NS or SCI. While extension to a distributed scheme based on NS is fairly 
straight-forward, implementing a scheme based on SCI which we call distributed complete information 
management or DCIM, by maintaining 8 b a for all links a and b (for a total of |£| 2 values), becomes similar to the 

SR/SSR scheme described in the prior art. The difference, however, is that while in SR/SSR, information on 5 b a is 
exchanged via LSAs (i.e., using CAB), we propose to use a dedicated signaling protocol as described earlier (e.g., 
CAM-IU, or any EDM-based method) to multicast the updated 8 b a to all ingress nodes to achieve a variety of 

trade-offs between path computational overhead, signaling overhead, and timeliness of the information updates. 

Finally, while DPIM already has a corresponding centralized control implementation (which is SPI), one 
can also implement, under centralized control, schemes corresponding to other variations of DPIM, such as 
DPIM-S, DPIM-A and DPIM-SA. 
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It will be appreciated that the instant specification, drawings and claims set forth by way of illustration 
and not limitation, and that various modification and changes may be made without departing from the spirit and 
scope of the present invention. 
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What we claim are: 

1 . A method to establish and release network connections with guaranteed bandwidth for networks 
under distributed control, wherein: 

each ingress node acts as a distributed controller that performs explicit routing of network packets, each of 
said ingress node maintaining only partial information on existing paths, said partial information on existing 
paths comprising total amount of bandwidth on every link that is currently reserved for all backup paths, and 
the residual bandwidth on every link. 

2. The method of claim 1, wherein said partial information on existing paths further comprises a 
total amount of bandwidth on every link dedicated to all active connections. 

3. The method of claim 1 or 2, wherein said network connections are protected against single link or 
node failures. 

4. The method of claim 1 or 2, wherein said network connections are unprotected against single link 
or node failures. 

5. The method of claim 1 or 2, wherein said network connections are pre-emptable by a protected 
connection upon a link or node failure. 

6. The method of claim 3, further comprising the steps of 

determining routes for an active path and a backup path by a distributed controller, said backup path 
being link or node disjoint with said active path, 

allocating or de-allocating bandwidth along said active path and said backup path using distributed 
signaling, and allowing bandwidth sharing among backup paths, and. 

updating and exchanging partial and aggregated information between distributed controllers as a result of 
establishing or releasing a connection. 

7. The method of claim 6, wherein the step of determining routes for an active path and a backup 
path utilizes methods based on Integer Linear Programming to minimize the sum of the bandwidth consumed 
by each pair of active path and backup path. 
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8. The method of claim 7, wherein the bandwidth consumed by the backup path is estimated based 
on the partial information available, 

each link whose estimated backup bandwidth is 0 is assigned a small non-zero cost to reduce the backup 
length and thus the recovery time, and 

the component in the objective cost function for the backup path is adjusted down by a fraction to reduce 
the total bandwidth consumption by all the connections. 

9. The method of claim 6, wherein the step of determining routes for an active path and a backup 
path utilizes an algorithm to find a shortest pair of paths after assigning each link a cost, the said cost is w if the 
said link has a residue bandwidth that is no less than w, and infinity if otherwise (which logically remove the link). 

10. The method of claim 6, wherein the step of determining routes for an active path and a backup 
path utilizes an algorithm that finds an active path first, comprising the steps of: 

determing an active path using any well-known shortest path algorithm, after logically removing the links 
whose residue bandwidth is less than w, and assigning each of the remaining links a cost that includes the 
bandwidth required by the active path plus any potential amount of additional bandwidth required by the 
yet-to-be-determined backup path, 

said potential amount of additional bandwidth being proportional to the maximum traffic carried on a 
given link a to be restored on any other link in case of failure of said given link and the bandwidth requested by the 
connection, 

once an active path is determined, all the links along the active path are logically removed, the 
corresponding backup path is found similarity using any well-known shortest path algorithm after 

each link is assigned either the requested bandwidth or an estimated cost if the cost is no greater than the 
residue bandwidth of the link, or infinity if otherwise. 

1 1 . The method of claim 1 , wherein signaling packets are sent along the active path and backup path 
respectively, 

said signalling packets sent along the active path contains the set of links along the backup path, 
said signalling packets sent along the backuo oath contains the set of links along the active path, and 
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each node along the backup path allocates minimum or de-allocates maximum amount of bandwidth based 
on the locally stored information at each node, independent of the estimated cost. 
12. . The method of claim 2. wherein each distributed controller at the edge maintains , for every link 
in the network, the amount of bandwidth allocated for backup paths, as well as the amount of residue 
bandwidth available. 

13. The method of claim 2, wherein each distributed controller at the edge maintains, in addition, the 
maximum amount of traffic carried that needs to be restored on any given link for every link in the 
network. 

14. The method of claim 2, wherein each distributed controller at a core or edge node maintains 
partial aggregated information on every local link, including the amount of bandwidth on every other 
link to be restored on the local link, and the amount of bandwidth carried on the local link that is to be 
restored on every other link. 

15. The method of Claims 12 and 13, further comprising methods to exchange the updated information 
among the edge and core controllers, wherein 

each core node along a newly established or released active path and backup path will multicast to all edge 
controllers with locally updated information. 

16. The method of Claim 15, , further comprising methods to exchange the updated information 
among the edge and core controllers, wherein 

signaling packets can collect the updated information along their ways, then either the destination 
receiving the signaling packets or the source receiving the correspond acknowledgment for the signaling 
packets can multicast the updated information to all other edge controllers, 

embedding the updated information in standard Link State Advertisement packets used by the 
Internet Protocol, and 

broadcasting said Link State Advertisement packets to all other nodes at pre-determined 
intervals. 
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